Confidence Intervals – Real Life Sampling Distributions

What if I only have one sample?

Approximate the variability you’d expect to see in other samples!

Bootstrapping!

A Bootstrap Resample

  • Assumes the original sample is “representative” of observations in the population.
  • Uses the original sample to generate new samples that might have occurred with additional sampling.


We can use the statistics from these bootstrap samples to approximate the true sampling distribution!

Why???

Estimating a population parameter

  • We are interested in knowing how a statistic varies from sample to sample.
  • Knowing a statistic’s behavior helps us make better / more informed decisions!
  • This helps us estimate what values are more or less likely for the population parameter to have.

Confidence Intervals

Capture a range of plausible values for the population parameter.


Are more likely to capture the population parameter than a point estimate.

Using bootstrap resamples to generate a confidence interval

From your original sample, resample with replacement the same number of times as your original sample.

This is your bootstrap resample.

Repeat this process many, many times.

Calculate a numerical summary (e.g., mean, median) for each bootstrap resample.

These are your bootstrap statistics

Bootstrap Distribution

definition: a distribution of the bootstrap statistics from every bootstrap resample


Displays the variability in the statistic that could have happened with repeated sampling.

Approximates the true sampling distribution!

Confidence Interval

Goal: Capture a range of plausible values for the population parameter.

How do I get this plausible range of values?


Bootstrapping!

Penguins!

Statistic: \(\beta_1\)

The relationship between penguin’s bill length and body mass for all penguins in the Palmer Archipelago

Generating a bootstrap resample

Step 1: specify() your response and explanatory variables

Step 2: generate() bootstrap resamples

Step 3: calculate() the statistic of interest

Declare your variables!

penguins %>% 
  specify(response = bill_length_mm, explanatory = body_mass_g)

Generate your resamples!

penguins %>% 
  specify(response = bill_length_mm, 
          explanatory = body_mass_g) %>% 
  generate(reps = 1, type = "bootstrap")


reps – the number of resamples you want to generate

"bootstrap" – the method that should be used to generate the new samples

Your turn!


Why do we resample with replacement when creating a bootstrap distribution?


When we resample with replacement from our original sample what are we assuming about our sample?

Calculate your statistics!

penguins %>% 
  specify(response = bill_length_mm, 
          explanatory = body_mass_g) %>% 
  generate(reps = 1, 
           type = "bootstrap") %>% 
  calculate(stat = "slope")


"slope" – the statistic of interest

The final product

visualize(boot1) + 
  labs(title = "Bootstrap Distribution of 5,000 reps", 
       x = "Slope Statistic")

What does one dot / point on a bootstrap distribution represent?

A plausible range of values for: \(\beta_1\)

visualise(boot1) +
  shade_confidence_interval(endpoints = boot1_CI, 
                            color = "red", fill = "pink") +  
  labs(title = "Bootstrap Distribution of 5,000 reps", 
       x = "Slope Statistic")

The 95% confidence interval is…

get_confidence_interval(boot1, 
                        level = 0.95, 
                        type = "percentile")


Lower Bound Upper Bound
0.00355 0.00453


What do we hope is captured by this interval?

How do we interpret this interval?

“We are 95% confident the slope of the relationship between bill length and body mass for all penguins in the Palmer Archipelago is between 0.00355 and 0.00453

What does it mean to be 95% confident?

Classic interpretation mistakes


“95% of the time the population parameter would fall between 0.00355 and 0.00453.”


“We are 95% confident the sample statistic is in our interval.”